Greetings! Today we analyse dataset: Expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning. You can downlodad ata here: https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression

Tasks

Tasks are: 1. Give a dataset description:

-how many mice?

-how many groups?

-are these groups balanced?

-check for NA

  1. Perform ANOVA for BDNF_N~class

  2. Make a linear model for ERBB4_N prediction on the basis of other proteins data:

-conduct diagnostics of the model;

-is it good solution?

  1. Perform PCA:

-make an ordination;

-plot scores;

-find ‘% explained’ for each PC;

-plot 3D plot for first 3 PCs.

0. Loading libraries

pckgs <- c('tidyverse', 'ggplot2', 'FactoMineR', 'multcompView', 'vegan', 'plotly','psych','car','gridExtra','ggpubr','rstatix','RColorBrewer')
for(i in pckgs){
  if(!require(i, character.only = T)){
    install.packages(i, dependencies = T)
    library(i)
  }
}

1.Data manipulating

data_mice <- readxl::read_xls('Data_Cortex_Nuclear.xls')

According to the dataset description, there are 72 mice with 15 experiments each. Working with such a complex ID(e.g. 309_1) is not so convinient,so I decided to split Mouse_ID column into 2 columns: Mouse_ID and Experiment

data_mice <- separate(data_mice,col = 1,into=c('Mouse_ID','Experiment'),sep = '_',remove=T)

We also need to change type of some variables:

data_mice$Mouse_ID <- as.factor(data_mice$Mouse_ID)
data_mice$Experiment <- as.factor(data_mice$Experiment)
data_mice$class <- as.factor(data_mice$class)
data_mice$Genotype <- as.factor(data_mice$Genotype)
data_mice$Treatment <- as.factor(data_mice$Treatment)
data_mice$Behavior <- as.factor(data_mice$Behavior)

P.S. I found this elegant solution to get rid fo manual changing data type here: https://gist.github.com/ramhiser/93fe37be439c480dc26c4bed8aab03dd

data_mice <- data_mice %>% mutate_if(is.character,as.factor)

1.1 How many mice?

Let’s verify mice count (we could also take a look at the structure):

mouse_count <- length(unique(data_mice$Mouse_ID))
str(data_mice)
## tibble [1,080 × 83] (S3: tbl_df/tbl/data.frame)
##  $ Mouse_ID       : Factor w/ 72 levels "18899","293",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Experiment     : Factor w/ 15 levels "1","10","11",..: 1 8 9 10 11 12 13 14 15 2 ...
##  $ DYRK1A_N       : num [1:1080] 0.504 0.515 0.509 0.442 0.435 ...
##  $ ITSN1_N        : num [1:1080] 0.747 0.689 0.73 0.617 0.617 ...
##  $ BDNF_N         : num [1:1080] 0.43 0.412 0.418 0.359 0.359 ...
##  $ NR1_N          : num [1:1080] 2.82 2.79 2.69 2.47 2.37 ...
##  $ NR2A_N         : num [1:1080] 5.99 5.69 5.62 4.98 4.72 ...
##  $ pAKT_N         : num [1:1080] 0.219 0.212 0.209 0.223 0.213 ...
##  $ pBRAF_N        : num [1:1080] 0.178 0.173 0.176 0.176 0.174 ...
##  $ pCAMKII_N      : num [1:1080] 2.37 2.29 2.28 2.15 2.13 ...
##  $ pCREB_N        : num [1:1080] 0.232 0.227 0.23 0.207 0.192 ...
##  $ pELK_N         : num [1:1080] 1.75 1.6 1.56 1.6 1.5 ...
##  $ pERK_N         : num [1:1080] 0.688 0.695 0.677 0.583 0.551 ...
##  $ pJNK_N         : num [1:1080] 0.306 0.299 0.291 0.297 0.287 ...
##  $ PKCA_N         : num [1:1080] 0.403 0.386 0.381 0.377 0.364 ...
##  $ pMEK_N         : num [1:1080] 0.297 0.281 0.282 0.314 0.278 ...
##  $ pNR1_N         : num [1:1080] 1.022 0.957 1.004 0.875 0.865 ...
##  $ pNR2A_N        : num [1:1080] 0.606 0.588 0.602 0.52 0.508 ...
##  $ pNR2B_N        : num [1:1080] 1.88 1.73 1.73 1.57 1.48 ...
##  $ pPKCAB_N       : num [1:1080] 2.31 2.04 2.02 2.13 2.01 ...
##  $ pRSK_N         : num [1:1080] 0.442 0.445 0.468 0.478 0.483 ...
##  $ AKT_N          : num [1:1080] 0.859 0.835 0.814 0.728 0.688 ...
##  $ BRAF_N         : num [1:1080] 0.416 0.4 0.4 0.386 0.368 ...
##  $ CAMKII_N       : num [1:1080] 0.37 0.356 0.368 0.363 0.355 ...
##  $ CREB_N         : num [1:1080] 0.179 0.174 0.174 0.179 0.175 ...
##  $ ELK_N          : num [1:1080] 1.87 1.76 1.77 1.29 1.32 ...
##  $ ERK_N          : num [1:1080] 3.69 3.49 3.57 2.97 2.9 ...
##  $ GSK3B_N        : num [1:1080] 1.54 1.51 1.5 1.42 1.36 ...
##  $ JNK_N          : num [1:1080] 0.265 0.256 0.26 0.26 0.251 ...
##  $ MEK_N          : num [1:1080] 0.32 0.304 0.312 0.279 0.274 ...
##  $ TRKA_N         : num [1:1080] 0.814 0.781 0.785 0.734 0.703 ...
##  $ RSK_N          : num [1:1080] 0.166 0.157 0.161 0.162 0.155 ...
##  $ APP_N          : num [1:1080] 0.454 0.431 0.423 0.411 0.399 ...
##  $ Bcatenin_N     : num [1:1080] 3.04 2.92 2.94 2.5 2.46 ...
##  $ SOD1_N         : num [1:1080] 0.37 0.342 0.344 0.345 0.329 ...
##  $ MTOR_N         : num [1:1080] 0.459 0.424 0.425 0.429 0.409 ...
##  $ P38_N          : num [1:1080] 0.335 0.325 0.325 0.33 0.313 ...
##  $ pMTOR_N        : num [1:1080] 0.825 0.762 0.757 0.747 0.692 ...
##  $ DSCR1_N        : num [1:1080] 0.577 0.545 0.544 0.547 0.537 ...
##  $ AMPKA_N        : num [1:1080] 0.448 0.421 0.405 0.387 0.361 ...
##  $ NR2B_N         : num [1:1080] 0.586 0.545 0.553 0.548 0.513 ...
##  $ pNUMB_N        : num [1:1080] 0.395 0.368 0.364 0.367 0.352 ...
##  $ RAPTOR_N       : num [1:1080] 0.34 0.322 0.313 0.328 0.312 ...
##  $ TIAM1_N        : num [1:1080] 0.483 0.455 0.447 0.443 0.419 ...
##  $ pP70S6_N       : num [1:1080] 0.294 0.276 0.257 0.399 0.393 ...
##  $ NUMB_N         : num [1:1080] 0.182 0.182 0.184 0.162 0.16 ...
##  $ P70S6_N        : num [1:1080] 0.843 0.848 0.856 0.76 0.768 ...
##  $ pGSK3B_N       : num [1:1080] 0.193 0.195 0.201 0.184 0.186 ...
##  $ pPKCG_N        : num [1:1080] 1.44 1.44 1.52 1.61 1.65 ...
##  $ CDK5_N         : num [1:1080] 0.295 0.294 0.302 0.296 0.297 ...
##  $ S6_N           : num [1:1080] 0.355 0.355 0.386 0.291 0.309 ...
##  $ ADARB1_N       : num [1:1080] 1.34 1.31 1.28 1.2 1.21 ...
##  $ AcetylH3K9_N   : num [1:1080] 0.17 0.171 0.185 0.16 0.165 ...
##  $ RRP1_N         : num [1:1080] 0.159 0.158 0.149 0.166 0.161 ...
##  $ BAX_N          : num [1:1080] 0.189 0.185 0.191 0.185 0.188 ...
##  $ ARC_N          : num [1:1080] 0.106 0.107 0.108 0.103 0.105 ...
##  $ ERBB4_N        : num [1:1080] 0.145 0.15 0.145 0.141 0.142 ...
##  $ nNOS_N         : num [1:1080] 0.177 0.178 0.176 0.164 0.168 ...
##  $ Tau_N          : num [1:1080] 0.125 0.134 0.133 0.123 0.137 ...
##  $ GFAP_N         : num [1:1080] 0.115 0.118 0.118 0.117 0.116 ...
##  $ GluR3_N        : num [1:1080] 0.228 0.238 0.245 0.235 0.256 ...
##  $ GluR4_N        : num [1:1080] 0.143 0.142 0.142 0.145 0.141 ...
##  $ IL1B_N         : num [1:1080] 0.431 0.457 0.51 0.431 0.481 ...
##  $ P3525_N        : num [1:1080] 0.248 0.258 0.255 0.251 0.252 ...
##  $ pCASP9_N       : num [1:1080] 1.6 1.67 1.66 1.48 1.53 ...
##  $ PSD95_N        : num [1:1080] 2.01 2 2.02 1.96 2.01 ...
##  $ SNCA_N         : num [1:1080] 0.108 0.11 0.108 0.12 0.12 ...
##  $ Ubiquitin_N    : num [1:1080] 1.045 1.01 0.997 0.99 0.998 ...
##  $ pGSK3B_Tyr216_N: num [1:1080] 0.832 0.849 0.847 0.833 0.879 ...
##  $ SHH_N          : num [1:1080] 0.189 0.2 0.194 0.192 0.206 ...
##  $ BAD_N          : num [1:1080] 0.123 0.117 0.119 0.133 0.13 ...
##  $ BCL2_N         : num [1:1080] NA NA NA NA NA NA NA NA NA NA ...
##  $ pS6_N          : num [1:1080] 0.106 0.107 0.108 0.103 0.105 ...
##  $ pCFOS_N        : num [1:1080] 0.108 0.104 0.106 0.111 0.111 ...
##  $ SYP_N          : num [1:1080] 0.427 0.442 0.436 0.392 0.434 ...
##  $ H3AcK18_N      : num [1:1080] 0.115 0.112 0.112 0.13 0.118 ...
##  $ EGR1_N         : num [1:1080] 0.132 0.135 0.133 0.147 0.14 ...
##  $ H3MeK4_N       : num [1:1080] 0.128 0.131 0.127 0.147 0.148 ...
##  $ CaNA_N         : num [1:1080] 1.68 1.74 1.93 1.7 1.84 ...
##  $ Genotype       : Factor w/ 2 levels "Control","Ts65Dn": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment      : Factor w/ 2 levels "Memantine","Saline": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Behavior       : Factor w/ 2 levels "C/S","S/C": 1 1 1 1 1 1 1 1 1 1 ...
##  $ class          : Factor w/ 8 levels "c-CS-m","c-CS-s",..: 1 1 1 1 1 1 1 1 1 1 ...

Yep, 72 mice.

1.2/ 1.3 How many groups contains our dataset?/ Are they balanced?

There are 8 classes (last column in dataset):

c-CS-s: control mice, stimulated to learn, injected with saline (9 mice)

c-CS-m: control mice, stimulated to learn, injected with memantine (10 mice)

c-SC-s: control mice, not stimulated to learn, injected with saline (9 mice)

c-SC-m: control mice, not stimulated to learn, injected with memantine (10 mice)

t-CS-s: trisomy mice, stimulated to learn, injected with saline (7 mice)

t-CS-m: trisomy mice, stimulated to learn, injected with memantine (9 mice)

t-SC-s: trisomy mice, not stimulated to learn, injected with saline (9 mice)

t-SC-m: trisomy mice, not stimulated to learn, injected with memantine (9 mice)

Our groups are almost the same size, I think it wouldn’t be a problem in future anlysis.

How to verify it (if you wish)? Using this code you can easily confirm dataset description:

data_mice %>% 
  group_by(class) %>% summarise(number = n() / 15)
## # A tibble: 8 x 2
##   class  number
## * <fct>   <dbl>
## 1 c-CS-m     10
## 2 c-CS-s      9
## 3 c-SC-m     10
## 4 c-SC-s      9
## 5 t-CS-m      9
## 6 t-CS-s      7
## 7 t-SC-m      9
## 8 t-SC-s      9

1.4 Missing values?

How many NA’s we have? There are 2 ways of checking:

table(is.na(data_mice))
## 
## FALSE  TRUE 
## 88244  1396
summary(data_mice)
##     Mouse_ID     Experiment     DYRK1A_N         ITSN1_N           BDNF_N      
##  18899  : 15   1      : 72   Min.   :0.1453   Min.   :0.2454   Min.   :0.1152  
##  293    : 15   10     : 72   1st Qu.:0.2881   1st Qu.:0.4734   1st Qu.:0.2874  
##  294    : 15   11     : 72   Median :0.3664   Median :0.5658   Median :0.3166  
##  309    : 15   12     : 72   Mean   :0.4258   Mean   :0.6171   Mean   :0.3191  
##  311    : 15   13     : 72   3rd Qu.:0.4877   3rd Qu.:0.6980   3rd Qu.:0.3482  
##  320    : 15   14     : 72   Max.   :2.5164   Max.   :2.6027   Max.   :0.4972  
##  (Other):990   (Other):648   NA's   :3        NA's   :3        NA's   :3       
##      NR1_N           NR2A_N          pAKT_N           pBRAF_N       
##  Min.   :1.331   Min.   :1.738   Min.   :0.06324   Min.   :0.06404  
##  1st Qu.:2.057   1st Qu.:3.156   1st Qu.:0.20575   1st Qu.:0.16459  
##  Median :2.297   Median :3.761   Median :0.23118   Median :0.18230  
##  Mean   :2.297   Mean   :3.844   Mean   :0.23317   Mean   :0.18185  
##  3rd Qu.:2.528   3rd Qu.:4.440   3rd Qu.:0.25726   3rd Qu.:0.19742  
##  Max.   :3.758   Max.   :8.483   Max.   :0.53905   Max.   :0.31707  
##  NA's   :3       NA's   :3       NA's   :3         NA's   :3        
##    pCAMKII_N        pCREB_N           pELK_N          pERK_N      
##  Min.   :1.344   Min.   :0.1128   Min.   :0.429   Min.   :0.1492  
##  1st Qu.:2.480   1st Qu.:0.1908   1st Qu.:1.204   1st Qu.:0.3374  
##  Median :3.327   Median :0.2106   Median :1.356   Median :0.4436  
##  Mean   :3.537   Mean   :0.2126   Mean   :1.429   Mean   :0.5459  
##  3rd Qu.:4.482   3rd Qu.:0.2346   3rd Qu.:1.561   3rd Qu.:0.6633  
##  Max.   :7.464   Max.   :0.3062   Max.   :6.113   Max.   :3.5667  
##  NA's   :3       NA's   :3        NA's   :3       NA's   :3       
##      pJNK_N            PKCA_N           pMEK_N            pNR1_N      
##  Min.   :0.05211   Min.   :0.1914   Min.   :0.05682   Min.   :0.5002  
##  1st Qu.:0.28124   1st Qu.:0.2818   1st Qu.:0.24429   1st Qu.:0.7435  
##  Median :0.32133   Median :0.3130   Median :0.27739   Median :0.8211  
##  Mean   :0.31351   Mean   :0.3179   Mean   :0.27503   Mean   :0.8258  
##  3rd Qu.:0.34871   3rd Qu.:0.3523   3rd Qu.:0.30345   3rd Qu.:0.8985  
##  Max.   :0.49343   Max.   :0.4740   Max.   :0.45800   Max.   :1.4082  
##  NA's   :3         NA's   :3        NA's   :3         NA's   :3       
##     pNR2A_N          pNR2B_N          pPKCAB_N          pRSK_N       
##  Min.   :0.2813   Min.   :0.3016   Min.   :0.5678   Min.   :0.09594  
##  1st Qu.:0.5903   1st Qu.:1.3813   1st Qu.:1.1683   1st Qu.:0.40414  
##  Median :0.7196   Median :1.5637   Median :1.3657   Median :0.44060  
##  Mean   :0.7269   Mean   :1.5620   Mean   :1.5253   Mean   :0.44285  
##  3rd Qu.:0.8486   3rd Qu.:1.7485   3rd Qu.:1.8859   3rd Qu.:0.48210  
##  Max.   :1.4128   Max.   :2.7240   Max.   :3.0614   Max.   :0.65096  
##  NA's   :3        NA's   :3        NA's   :3        NA's   :3        
##      AKT_N             BRAF_N          CAMKII_N          CREB_N      
##  Min.   :0.06442   Min.   :0.1439   Min.   :0.2130   Min.   :0.1136  
##  1st Qu.:0.59682   1st Qu.:0.2643   1st Qu.:0.3309   1st Qu.:0.1618  
##  Median :0.68247   Median :0.3267   Median :0.3603   Median :0.1796  
##  Mean   :0.68224   Mean   :0.3785   Mean   :0.3634   Mean   :0.1805  
##  3rd Qu.:0.75969   3rd Qu.:0.4136   3rd Qu.:0.3939   3rd Qu.:0.1957  
##  Max.   :1.18217   Max.   :2.1334   Max.   :0.5862   Max.   :0.3196  
##  NA's   :3         NA's   :3        NA's   :3        NA's   :3       
##      ELK_N            ERK_N          GSK3B_N           JNK_N       
##  Min.   :0.4977   Min.   :1.132   Min.   :0.1511   Min.   :0.0463  
##  1st Qu.:0.9444   1st Qu.:1.992   1st Qu.:1.0231   1st Qu.:0.2204  
##  Median :1.0962   Median :2.401   Median :1.1598   Median :0.2449  
##  Mean   :1.1734   Mean   :2.474   Mean   :1.1726   Mean   :0.2416  
##  3rd Qu.:1.3236   3rd Qu.:2.873   3rd Qu.:1.3097   3rd Qu.:0.2633  
##  Max.   :2.8029   Max.   :5.198   Max.   :2.4758   Max.   :0.3872  
##  NA's   :18       NA's   :3       NA's   :3        NA's   :3       
##      MEK_N            TRKA_N           RSK_N            APP_N       
##  Min.   :0.1472   Min.   :0.1987   Min.   :0.1074   Min.   :0.2356  
##  1st Qu.:0.2471   1st Qu.:0.6171   1st Qu.:0.1496   1st Qu.:0.3663  
##  Median :0.2734   Median :0.7050   Median :0.1667   Median :0.4020  
##  Mean   :0.2728   Mean   :0.6932   Mean   :0.1684   Mean   :0.4048  
##  3rd Qu.:0.3008   3rd Qu.:0.7742   3rd Qu.:0.1845   3rd Qu.:0.4419  
##  Max.   :0.4154   Max.   :1.0016   Max.   :0.3051   Max.   :0.6327  
##  NA's   :7        NA's   :3        NA's   :3        NA's   :3       
##    Bcatenin_N        SOD1_N           MTOR_N           P38_N       
##  Min.   :1.135   Min.   :0.2171   Min.   :0.2011   Min.   :0.2279  
##  1st Qu.:1.827   1st Qu.:0.3196   1st Qu.:0.4104   1st Qu.:0.3520  
##  Median :2.115   Median :0.4441   Median :0.4525   Median :0.4078  
##  Mean   :2.147   Mean   :0.5426   Mean   :0.4525   Mean   :0.4153  
##  3rd Qu.:2.424   3rd Qu.:0.6958   3rd Qu.:0.4880   3rd Qu.:0.4663  
##  Max.   :3.681   Max.   :1.8729   Max.   :0.6767   Max.   :0.9333  
##  NA's   :18      NA's   :3        NA's   :3        NA's   :3       
##     pMTOR_N          DSCR1_N          AMPKA_N           NR2B_N      
##  Min.   :0.1666   Min.   :0.1553   Min.   :0.2264   Min.   :0.1848  
##  1st Qu.:0.6835   1st Qu.:0.5309   1st Qu.:0.3266   1st Qu.:0.5149  
##  Median :0.7608   Median :0.5767   Median :0.3585   Median :0.5635  
##  Mean   :0.7590   Mean   :0.5852   Mean   :0.3684   Mean   :0.5653  
##  3rd Qu.:0.8415   3rd Qu.:0.6344   3rd Qu.:0.4008   3rd Qu.:0.6145  
##  Max.   :1.1249   Max.   :0.9164   Max.   :0.7008   Max.   :0.9720  
##  NA's   :3        NA's   :3        NA's   :3        NA's   :3       
##     pNUMB_N          RAPTOR_N         TIAM1_N          pP70S6_N     
##  Min.   :0.1856   Min.   :0.1948   Min.   :0.2378   Min.   :0.1311  
##  1st Qu.:0.3128   1st Qu.:0.2761   1st Qu.:0.3720   1st Qu.:0.2811  
##  Median :0.3474   Median :0.3049   Median :0.4072   Median :0.3777  
##  Mean   :0.3571   Mean   :0.3158   Mean   :0.4186   Mean   :0.3945  
##  3rd Qu.:0.3927   3rd Qu.:0.3473   3rd Qu.:0.4560   3rd Qu.:0.4811  
##  Max.   :0.6311   Max.   :0.5267   Max.   :0.7221   Max.   :1.1292  
##  NA's   :3        NA's   :3        NA's   :3        NA's   :3       
##      NUMB_N          P70S6_N          pGSK3B_N          pPKCG_N      
##  Min.   :0.1180   Min.   :0.3441   Min.   :0.09998   Min.   :0.5988  
##  1st Qu.:0.1593   1st Qu.:0.8267   1st Qu.:0.14925   1st Qu.:1.2968  
##  Median :0.1782   Median :0.9313   Median :0.16021   Median :1.6646  
##  Mean   :0.1811   Mean   :0.9431   Mean   :0.16121   Mean   :1.7066  
##  3rd Qu.:0.1972   3rd Qu.:1.0451   3rd Qu.:0.17174   3rd Qu.:2.1130  
##  Max.   :0.3166   Max.   :1.6800   Max.   :0.25321   Max.   :3.3820  
##                                                                      
##      CDK5_N            S6_N           ADARB1_N       AcetylH3K9_N    
##  Min.   :0.1812   Min.   :0.1302   Min.   :0.5291   Min.   :0.05253  
##  1st Qu.:0.2726   1st Qu.:0.3167   1st Qu.:0.9305   1st Qu.:0.10357  
##  Median :0.2938   Median :0.4010   Median :1.1283   Median :0.15042  
##  Mean   :0.2924   Mean   :0.4292   Mean   :1.1974   Mean   :0.21648  
##  3rd Qu.:0.3125   3rd Qu.:0.5349   3rd Qu.:1.3802   3rd Qu.:0.26965  
##  Max.   :0.8174   Max.   :0.8226   Max.   :2.5399   Max.   :1.45939  
##                                                                      
##      RRP1_N             BAX_N             ARC_N            ERBB4_N      
##  Min.   :-0.06201   Min.   :0.07233   Min.   :0.06725   Min.   :0.1002  
##  1st Qu.: 0.14902   1st Qu.:0.16817   1st Qu.:0.11084   1st Qu.:0.1470  
##  Median : 0.16210   Median :0.18074   Median :0.12163   Median :0.1564  
##  Mean   : 0.16663   Mean   :0.17931   Mean   :0.12152   Mean   :0.1565  
##  3rd Qu.: 0.17741   3rd Qu.:0.19158   3rd Qu.:0.13196   3rd Qu.:0.1654  
##  Max.   : 0.61238   Max.   :0.24114   Max.   :0.15875   Max.   :0.2087  
##                                                                         
##      nNOS_N            Tau_N             GFAP_N           GluR3_N      
##  Min.   :0.09973   Min.   :0.09623   Min.   :0.08611   Min.   :0.1114  
##  1st Qu.:0.16645   1st Qu.:0.16799   1st Qu.:0.11277   1st Qu.:0.1957  
##  Median :0.18267   Median :0.18863   Median :0.12046   Median :0.2169  
##  Mean   :0.18130   Mean   :0.21049   Mean   :0.12089   Mean   :0.2219  
##  3rd Qu.:0.19857   3rd Qu.:0.23394   3rd Qu.:0.12772   3rd Qu.:0.2460  
##  Max.   :0.26074   Max.   :0.60277   Max.   :0.21362   Max.   :0.3310  
##                                                                        
##     GluR4_N            IL1B_N          P3525_N          pCASP9_N     
##  Min.   :0.07258   Min.   :0.2840   Min.   :0.2074   Min.   :0.8532  
##  1st Qu.:0.10889   1st Qu.:0.4756   1st Qu.:0.2701   1st Qu.:1.3756  
##  Median :0.12355   Median :0.5267   Median :0.2906   Median :1.5227  
##  Mean   :0.12656   Mean   :0.5273   Mean   :0.2913   Mean   :1.5483  
##  3rd Qu.:0.14195   3rd Qu.:0.5770   3rd Qu.:0.3116   3rd Qu.:1.7131  
##  Max.   :0.53700   Max.   :0.8897   Max.   :0.4437   Max.   :2.5862  
##                                                                      
##     PSD95_N          SNCA_N        Ubiquitin_N     pGSK3B_Tyr216_N 
##  Min.   :1.206   Min.   :0.1012   Min.   :0.7507   Min.   :0.5774  
##  1st Qu.:2.079   1st Qu.:0.1428   1st Qu.:1.1163   1st Qu.:0.7937  
##  Median :2.242   Median :0.1575   Median :1.2366   Median :0.8499  
##  Mean   :2.235   Mean   :0.1598   Mean   :1.2393   Mean   :0.8488  
##  3rd Qu.:2.420   3rd Qu.:0.1733   3rd Qu.:1.3631   3rd Qu.:0.9162  
##  Max.   :2.878   Max.   :0.2576   Max.   :1.8972   Max.   :1.2046  
##                                                                    
##      SHH_N            BAD_N            BCL2_N            pS6_N        
##  Min.   :0.1559   Min.   :0.0883   Min.   :0.08066   Min.   :0.06725  
##  1st Qu.:0.2064   1st Qu.:0.1364   1st Qu.:0.11555   1st Qu.:0.11084  
##  Median :0.2240   Median :0.1523   Median :0.12947   Median :0.12163  
##  Mean   :0.2267   Mean   :0.1579   Mean   :0.13476   Mean   :0.12152  
##  3rd Qu.:0.2417   3rd Qu.:0.1740   3rd Qu.:0.14823   3rd Qu.:0.13196  
##  Max.   :0.3583   Max.   :0.2820   Max.   :0.26151   Max.   :0.15875  
##                   NA's   :213      NA's   :285                        
##     pCFOS_N            SYP_N          H3AcK18_N           EGR1_N      
##  Min.   :0.08542   Min.   :0.2586   Min.   :0.07969   Min.   :0.1055  
##  1st Qu.:0.11351   1st Qu.:0.3981   1st Qu.:0.12585   1st Qu.:0.1551  
##  Median :0.12652   Median :0.4485   Median :0.15824   Median :0.1749  
##  Mean   :0.13105   Mean   :0.4461   Mean   :0.16961   Mean   :0.1831  
##  3rd Qu.:0.14365   3rd Qu.:0.4908   3rd Qu.:0.19788   3rd Qu.:0.2045  
##  Max.   :0.25653   Max.   :0.7596   Max.   :0.47976   Max.   :0.3607  
##  NA's   :75                         NA's   :180       NA's   :210     
##     H3MeK4_N          CaNA_N          Genotype       Treatment   Behavior 
##  Min.   :0.1018   Min.   :0.5865   Control:570   Memantine:570   C/S:525  
##  1st Qu.:0.1651   1st Qu.:1.0814   Ts65Dn :510   Saline   :510   S/C:555  
##  Median :0.1940   Median :1.3174                                          
##  Mean   :0.2054   Mean   :1.3378                                          
##  3rd Qu.:0.2352   3rd Qu.:1.5858                                          
##  Max.   :0.4139   Max.   :2.1298                                          
##  NA's   :270                                                              
##      class    
##  c-CS-m :150  
##  c-SC-m :150  
##  c-CS-s :135  
##  c-SC-s :135  
##  t-CS-m :135  
##  t-SC-m :135  
##  (Other):240

There are 1396 NAs. i don’t want to lose any data for any protein. It should be noted that some proteins measurements data contains extreme amount of NAs. We could simply remove them entirely, but for some proteins the amount of data will shrink drastically.We would do some other stuff: we will replace NAs with means according to class (not the whole column!This type of change (using the whole column) would be a disaster in biological sense). Replacing with 0’s will significantly change other stats.

data_mice[,c(3:79,83)] <- data_mice[,c(3:79,83)] %>% group_by(class) %>% 
  mutate_all(funs(ifelse(is.na(.), mean(., na.rm = TRUE),.)))

Let’s check if we successfully replaced NAs:

table(is.na(data_mice))
## 
## FALSE 
## 89640

No NAs found. Now we’re talking.

2. Difference in BDNF_N level by class

2.0. Check for distribution

Let’s have a look at BDNF_N level of each class:

ggplot(data_mice, aes(class, BDNF_N)) + geom_boxplot(aes(fill = class)) + theme_bw() + scale_fill_brewer(palette="Accent") + ggtitle('BDNF_N level plot')

First of all, we should check if BDNF_N data is normally distributed:

shapiro.test(data_mice$BDNF_N)
## 
##  Shapiro-Wilk normality test
## 
## data:  data_mice$BDNF_N
## W = 0.99261, p-value = 3.211e-05

Far from normal distribution. But let’s try to use one-way ANOVA despite of this fact,and after then let’s have a look at QQplot of standardized resilduals. If QQplot would be good, so our ANOVA analysis could be applied. It should be noted that we must use pairwise comparison tests after ANOVA. I chose to conduct Tukey post-hoc test.

2.1. ANOVA & Tukey test

anova1 <- aov(BDNF_N~class,data_mice)
posthoc1 <- TukeyHSD(x=anova1)
posthoc1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = BDNF_N ~ class, data = data_mice)
## 
## $class
##                        diff          lwr           upr     p adj
## c-CS-s-c-CS-m  0.0030979096 -0.013720984  0.0199168033 0.9992916
## c-SC-m-c-CS-m -0.0482716607 -0.064641970 -0.0319013517 0.0000000
## c-SC-s-c-CS-m -0.0258249025 -0.042643796 -0.0090060088 0.0000951
## t-CS-m-c-CS-m -0.0264852092 -0.043304103 -0.0096663155 0.0000538
## t-CS-s-c-CS-m -0.0337569838 -0.051796186 -0.0157177818 0.0000005
## t-SC-m-c-CS-m -0.0181541101 -0.034973004 -0.0013352164 0.0238955
## t-SC-s-c-CS-m -0.0136310252 -0.030449919  0.0031878685 0.2134012
## c-SC-m-c-CS-s -0.0513695703 -0.068188464 -0.0345506766 0.0000000
## c-SC-s-c-CS-s -0.0289228121 -0.046178633 -0.0116669913 0.0000116
## t-CS-m-c-CS-s -0.0295831188 -0.046838940 -0.0123272979 0.0000064
## t-CS-s-c-CS-s -0.0368548934 -0.055302142 -0.0184076449 0.0000001
## t-SC-m-c-CS-s -0.0212520197 -0.038507841 -0.0039961989 0.0047693
## t-SC-s-c-CS-s -0.0167289348 -0.033984756  0.0005268861 0.0651201
## c-SC-s-c-SC-m  0.0224467582  0.005627864  0.0392656519 0.0013973
## t-CS-m-c-SC-m  0.0217864515  0.004967558  0.0386053452 0.0022588
## t-CS-s-c-SC-m  0.0145146769 -0.003524525  0.0325538789 0.2215086
## t-SC-m-c-SC-m  0.0301175506  0.013298657  0.0469364443 0.0000019
## t-SC-s-c-SC-m  0.0346406355  0.017821742  0.0514595292 0.0000000
## t-CS-m-c-SC-s -0.0006603067 -0.017916128  0.0165955142 1.0000000
## t-CS-s-c-SC-s -0.0079320813 -0.026379330  0.0105151672 0.8967905
## t-SC-m-c-SC-s  0.0076707924 -0.009585028  0.0249266133 0.8792631
## t-SC-s-c-SC-s  0.0121938774 -0.005061943  0.0294496982 0.3859175
## t-CS-s-t-CS-m -0.0072717746 -0.025719023  0.0111754738 0.9328307
## t-SC-m-t-CS-m  0.0083310991 -0.008924722  0.0255869199 0.8253159
## t-SC-s-t-CS-m  0.0128541840 -0.004401637  0.0301100049 0.3156819
## t-SC-m-t-CS-s  0.0156028737 -0.002844375  0.0340501221 0.1686252
## t-SC-s-t-CS-s  0.0201259586  0.001678710  0.0385732071 0.0213198
## t-SC-s-t-SC-m  0.0045230850 -0.012732736  0.0217789058 0.9933333
plot(posthoc1,las=1,cex.axis=0.5,col="brown")

Difference between classes in BDNF_N level are significant.

How to interpret this plot (in naive but right way)? If bar is not in contact with ‘0.0’ line, it means that difference is big enough to be significant.

Let’s check if our results are satisfied to the conditions of applicability of ANOVA.

There are 4 conditions:

-normal distribution of residuals;

-homogenity of residuals variance;

-lack of the collinearity;

-independent measurements in each group/class.

anova_diag <- fortify(anova1)
gg_resid1 <- ggplot(data = anova_diag, aes(x = .fitted, y = .stdresid)) + 
  geom_point() + 
  geom_smooth(method = 'lm') +
  geom_hline(yintercept = 0)+
  geom_hline(yintercept = 2, color = "red") +
  geom_hline(yintercept = -2, color = "red")+
  xlab('Predicted') + ylab ('Standardized residual') + theme_bw() + ggtitle('Standardized residuals distribution plot')
gg_resid1
## `geom_smooth()` using formula 'y ~ x'

We found no pattern of distribution of residuals.

More fancy plot to display residuals distribution:

ggplot(anova_diag, aes(class, .stdresid, fill = class)) + geom_boxplot() +
         xlab('Class') + ylab ('Standardized residuals') + ggtitle ('Standardized residuals distribution plot') + 
         theme_bw() + scale_fill_brewer(palette="Dark2")

qqPlot(anova_diag$.stdresid)

## [1] 182 918

The distribution is almost normal (except of the upper part of the curve). I think ANOVA can handle this, our results can be trusted.

2.2. BONUS: Kraskal-Wallis test & Dunn test

For all fastidious statisticians striving for perfection, I also made a Kruskal-Wallis test with Dunn test for pairwise comparisons.

kruskal_anova <-data_mice %>% kruskal_test(BDNF_N~class)

pwc_nonpar <-data_mice %>% dunn_test(BDNF_N~class, p.adjust.method = "hochberg")
pwc_nonpar
## # A tibble: 28 x 9
##    .y.    group1 group2    n1    n2 statistic        p    p.adj p.adj.signif
##  * <chr>  <chr>  <chr>  <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
##  1 BDNF_N c-CS-m c-CS-s   150   135     0.145 8.85e- 1 8.85e- 1 ns          
##  2 BDNF_N c-CS-m c-SC-m   150   150    -8.65  4.97e-18 1.39e-16 ****        
##  3 BDNF_N c-CS-m c-SC-s   150   135    -4.21  2.59e- 5 4.66e- 4 ***         
##  4 BDNF_N c-CS-m t-CS-m   150   135    -4.53  6.00e- 6 1.26e- 4 ***         
##  5 BDNF_N c-CS-m t-CS-s   150   105    -6.02  1.79e- 9 4.31e- 8 ****        
##  6 BDNF_N c-CS-m t-SC-m   150   135    -3.03  2.41e- 3 3.13e- 2 *           
##  7 BDNF_N c-CS-m t-SC-s   150   135    -2.29  2.21e- 2 2.43e- 1 ns          
##  8 BDNF_N c-CS-s c-SC-m   135   150    -8.57  1.05e-17 2.84e-16 ****        
##  9 BDNF_N c-CS-s c-SC-s   135   135    -4.24  2.22e- 5 4.44e- 4 ***         
## 10 BDNF_N c-CS-s t-CS-m   135   135    -4.55  5.29e- 6 1.16e- 4 ***         
## # … with 18 more rows
pwc_nonpar <- pwc_nonpar %>% add_xy_position(x = "class")
ggboxplot(data_mice, x = "class", y = "BDNF_N") +
  stat_pvalue_manual(pwc_nonpar, hide.ns = TRUE) +
  labs(
    subtitle = get_test_label(kruskal_anova, detailed = TRUE),
    caption = get_pwc_label(pwc_nonpar)
  )

Results are the same.

3. Linear model for ERBB4_N

3.1. Full model and initial corrections

The first step would be a building of full model which contains all of the predictors:

fit1 <- lm(ERBB4_N~.-Mouse_ID-Experiment-Genotype-Treatment-class-Behavior,data=data_mice)
summary(fit1)
## 
## Call:
## lm(formula = ERBB4_N ~ . - Mouse_ID - Experiment - Genotype - 
##     Treatment - class - Behavior, data = data_mice)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.025516 -0.003834 -0.000075  0.003593  0.038233 
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.0274958  0.0057420   4.789 1.93e-06 ***
## DYRK1A_N        -0.0077591  0.0054504  -1.424 0.154877    
## ITSN1_N          0.0121514  0.0059337   2.048 0.040833 *  
## BDNF_N          -0.0018158  0.0142494  -0.127 0.898624    
## NR1_N           -0.0066853  0.0038133  -1.753 0.079882 .  
## NR2A_N           0.0002810  0.0009857   0.285 0.775635    
## pAKT_N           0.0290597  0.0144957   2.005 0.045262 *  
## pBRAF_N         -0.0488960  0.0222090  -2.202 0.027919 *  
## pCAMKII_N       -0.0001033  0.0004580  -0.225 0.821664    
## pCREB_N         -0.0430923  0.0201837  -2.135 0.033002 *  
## pELK_N          -0.0007096  0.0009696  -0.732 0.464427    
## pERK_N          -0.0023965  0.0026422  -0.907 0.364612    
## pJNK_N          -0.0380089  0.0167366  -2.271 0.023358 *  
## PKCA_N           0.0414818  0.0156256   2.655 0.008063 ** 
## pMEK_N          -0.0075635  0.0162765  -0.465 0.642256    
## pNR1_N          -0.0201548  0.0089399  -2.254 0.024380 *  
## pNR2A_N          0.0053618  0.0044320   1.210 0.226642    
## pNR2B_N          0.0033623  0.0038226   0.880 0.379295    
## pPKCAB_N         0.0019429  0.0019169   1.014 0.311055    
## pRSK_N           0.0155977  0.0088832   1.756 0.079416 .  
## AKT_N            0.0118477  0.0060490   1.959 0.050435 .  
## BRAF_N          -0.0104543  0.0067836  -1.541 0.123604    
## CAMKII_N         0.0243916  0.0133265   1.830 0.067501 .  
## CREB_N          -0.0282832  0.0201232  -1.406 0.160182    
## ELK_N            0.0022419  0.0026323   0.852 0.394574    
## ERK_N            0.0041749  0.0016293   2.562 0.010543 *  
## GSK3B_N          0.0025504  0.0040383   0.632 0.527820    
## JNK_N            0.0037357  0.0209840   0.178 0.858738    
## MEK_N            0.0063418  0.0140818   0.450 0.652554    
## TRKA_N           0.0015545  0.0086639   0.179 0.857645    
## RSK_N            0.0373716  0.0250948   1.489 0.136744    
## APP_N           -0.0001430  0.0083580  -0.017 0.986350    
## Bcatenin_N       0.0010572  0.0028745   0.368 0.713102    
## SOD1_N          -0.0034438  0.0020719  -1.662 0.096798 .  
## MTOR_N           0.0379793  0.0119760   3.171 0.001564 ** 
## P38_N            0.0004134  0.0087135   0.047 0.962165    
## pMTOR_N         -0.0141851  0.0057399  -2.471 0.013627 *  
## DSCR1_N          0.0014968  0.0061796   0.242 0.808664    
## AMPKA_N         -0.0219883  0.0153626  -1.431 0.152661    
## NR2B_N           0.0035785  0.0067541   0.530 0.596352    
## pNUMB_N          0.0088084  0.0128920   0.683 0.494610    
## RAPTOR_N         0.0134691  0.0166112   0.811 0.417646    
## TIAM1_N         -0.0177624  0.0121355  -1.464 0.143596    
## pP70S6_N         0.0020408  0.0038899   0.525 0.599957    
## NUMB_N          -0.0428467  0.0237500  -1.804 0.071520 .  
## P70S6_N         -0.0068535  0.0035596  -1.925 0.054470 .  
## pGSK3B_N         0.0680863  0.0259059   2.628 0.008714 ** 
## pPKCG_N         -0.0072380  0.0013190  -5.488 5.16e-08 ***
## CDK5_N           0.0097937  0.0094400   1.037 0.299768    
## S6_N             0.0002482  0.0043802   0.057 0.954832    
## ADARB1_N        -0.0003681  0.0013816  -0.266 0.789971    
## AcetylH3K9_N     0.0045392  0.0041963   1.082 0.279642    
## RRP1_N          -0.0208306  0.0092596  -2.250 0.024690 *  
## BAX_N           -0.0039991  0.0238661  -0.168 0.866960    
## ARC_N            0.1436985  0.0398005   3.610 0.000321 ***
## nNOS_N           0.0138962  0.0189010   0.735 0.462383    
## Tau_N            0.0510379  0.0112052   4.555 5.89e-06 ***
## GFAP_N          -0.0535168  0.0311494  -1.718 0.086092 .  
## GluR3_N         -0.0206878  0.0125084  -1.654 0.098457 .  
## GluR4_N         -0.0218156  0.0111955  -1.949 0.051621 .  
## IL1B_N           0.0202090  0.0073967   2.732 0.006403 ** 
## P3525_N          0.0880711  0.0160245   5.496 4.92e-08 ***
## pCASP9_N         0.0126000  0.0022052   5.714 1.46e-08 ***
## PSD95_N          0.0175607  0.0021005   8.360  < 2e-16 ***
## SNCA_N           0.0199547  0.0240340   0.830 0.406583    
## Ubiquitin_N     -0.0007648  0.0033327  -0.229 0.818540    
## pGSK3B_Tyr216_N  0.0227687  0.0065117   3.497 0.000492 ***
## SHH_N            0.0085335  0.0128293   0.665 0.506103    
## BAD_N           -0.0463235  0.0159592  -2.903 0.003782 ** 
## BCL2_N           0.0122268  0.0149683   0.817 0.414210    
## pS6_N                   NA         NA      NA       NA    
## pCFOS_N         -0.0490420  0.0166173  -2.951 0.003238 ** 
## SYP_N            0.0215151  0.0065234   3.298 0.001007 ** 
## H3AcK18_N        0.0089403  0.0070752   1.264 0.206663    
## EGR1_N           0.0105255  0.0111806   0.941 0.346722    
## H3MeK4_N         0.0203880  0.0094645   2.154 0.031465 *  
## CaNA_N          -0.0065095  0.0024444  -2.663 0.007869 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.006661 on 1004 degrees of freedom
## Multiple R-squared:  0.8182, Adjusted R-squared:  0.8046 
## F-statistic: 60.25 on 75 and 1004 DF,  p-value: < 2.2e-16

Hm, 1 NA suddenly found. It’s a pS6_N, and NA means that this variable is linearly related to the other variables, thus causing perfect collinearity. This fact can also be found when making VIF estimation:

vif(fit1)
## Error in vif.default(fit1): there are aliased coefficients in the model
perf_coll_vars <- attributes(alias(fit1)$Complete)$dimnames[[1]]

We have to remove pS6_N from our model.

fit2 <- update(fit1, .~. - pS6_N)

Time to diagnostics!

3.2 Diagnostics of the full model

It;s time to do everyday’s job.

fit2_diag <- fortify(fit2)
head(fit2_diag)
##     ERBB4_N  DYRK1A_N   ITSN1_N    BDNF_N    NR1_N   NR2A_N    pAKT_N   pBRAF_N
## 1 0.1449893 0.5036439 0.7471932 0.4301753 2.816329 5.990152 0.2188300 0.1775655
## 2 0.1504709 0.5146171 0.6890635 0.4117703 2.789514 5.685038 0.2116362 0.1728170
## 3 0.1453302 0.5091831 0.7302468 0.4183088 2.687201 5.622059 0.2090109 0.1757222
## 4 0.1406558 0.4421067 0.6170762 0.3586263 2.466947 4.979503 0.2228858 0.1764626
## 5 0.1419830 0.4349402 0.6174298 0.3588022 2.365785 4.718679 0.2131059 0.1736270
## 6 0.1395645 0.4475064 0.6281758 0.3673881 2.385939 4.807635 0.2185778 0.1762334
##   pCAMKII_N   pCREB_N   pELK_N    pERK_N    pJNK_N    PKCA_N    pMEK_N
## 1  2.373744 0.2322238 1.750936 0.6879062 0.3063817 0.4026984 0.2969273
## 2  2.292150 0.2269721 1.596377 0.6950062 0.2990511 0.3859868 0.2813189
## 3  2.283337 0.2302468 1.561316 0.6773484 0.2912761 0.3810025 0.2817103
## 4  2.152301 0.2070042 1.595086 0.5832768 0.2967287 0.3770870 0.3138320
## 5  2.134014 0.1921579 1.504230 0.5509601 0.2869612 0.3635021 0.2779643
## 6  2.141282 0.1951875 1.442398 0.5663396 0.2898239 0.3638930 0.2668369
##      pNR1_N   pNR2A_N  pNR2B_N pPKCAB_N    pRSK_N     AKT_N    BRAF_N  CAMKII_N
## 1 1.0220603 0.6056726 1.877684 2.308745 0.4415994 0.8593658 0.4162891 0.3696080
## 2 0.9566759 0.5875587 1.725774 2.043037 0.4452219 0.8346593 0.4003642 0.3561775
## 3 1.0036350 0.6024488 1.731873 2.017984 0.4676679 0.8143294 0.3998469 0.3680888
## 4 0.8753903 0.5202932 1.566852 2.132754 0.4776707 0.7277046 0.3856387 0.3629700
## 5 0.8649120 0.5079898 1.480059 2.013697 0.4834161 0.6877937 0.3675305 0.3553109
## 6 0.8591209 0.5213066 1.538244 1.968275 0.4959000 0.6724022 0.3694045 0.3571717
##      CREB_N    ELK_N    ERK_N  GSK3B_N     JNK_N     MEK_N    TRKA_N     RSK_N
## 1 0.1789443 1.866358 3.685247 1.537227 0.2645263 0.3196770 0.8138665 0.1658460
## 2 0.1736797 1.761047 3.485287 1.509249 0.2557270 0.3044187 0.7805042 0.1571935
## 3 0.1739047 1.765544 3.571456 1.501244 0.2596135 0.3117467 0.7851540 0.1608954
## 4 0.1794489 1.286277 2.970137 1.419710 0.2595358 0.2792181 0.7344917 0.1622099
## 5 0.1748355 1.324695 2.896334 1.359876 0.2507050 0.2736672 0.7026991 0.1548274
## 6 0.1797285 1.227450 2.956983 1.447910 0.2508402 0.2840436 0.7043958 0.1568759
##       APP_N Bcatenin_N    SOD1_N    MTOR_N     P38_N   pMTOR_N   DSCR1_N
## 1 0.4539098   3.037621 0.3695096 0.4585385 0.3353358 0.8251920 0.5769155
## 2 0.4309403   2.921882 0.3422793 0.4235599 0.3248347 0.7617176 0.5450973
## 3 0.4231873   2.944136 0.3436962 0.4250048 0.3248517 0.7570308 0.5436197
## 4 0.4106149   2.500204 0.3445093 0.4292113 0.3301208 0.7469798 0.5467626
## 5 0.3985498   2.456560 0.3291258 0.4087552 0.3134148 0.6919565 0.5368605
## 6 0.3910472   2.467133 0.3275978 0.4044899 0.2962764 0.6744186 0.5397231
##     AMPKA_N    NR2B_N   pNUMB_N  RAPTOR_N   TIAM1_N  pP70S6_N    NUMB_N
## 1 0.4480993 0.5862714 0.3947213 0.3395706 0.4828639 0.2941698 0.1821505
## 2 0.4208761 0.5450973 0.3682546 0.3219592 0.4545193 0.2764306 0.1820863
## 3 0.4046298 0.5529941 0.3638799 0.3130859 0.4471972 0.2566482 0.1843877
## 4 0.3868603 0.5478485 0.3667707 0.3284919 0.4426497 0.3985340 0.1617677
## 5 0.3608164 0.5128240 0.3515510 0.3122063 0.4190949 0.3934470 0.1602002
## 6 0.3542143 0.5143164 0.3472241 0.3031321 0.4128243 0.3825783 0.1623303
##     P70S6_N  pGSK3B_N  pPKCG_N    CDK5_N      S6_N ADARB1_N AcetylH3K9_N
## 1 0.8427252 0.1926084 1.443091 0.2947000 0.3546045 1.339070    0.1701188
## 2 0.8476146 0.1948153 1.439460 0.2940598 0.3545483 1.306323    0.1714271
## 3 0.8561658 0.2007373 1.524364 0.3018807 0.3860868 1.279600    0.1854563
## 4 0.7602335 0.1841694 1.612382 0.2963818 0.2906795 1.198765    0.1597991
## 5 0.7681129 0.1857183 1.645807 0.2968294 0.3093450 1.206995    0.1646503
## 6 0.7796946 0.1867930 1.634615 0.2880373 0.3323671 1.123445    0.1756929
##      RRP1_N     BAX_N     ARC_N    nNOS_N     Tau_N    GFAP_N   GluR3_N
## 1 0.1591024 0.1888517 0.1063052 0.1766677 0.1251904 0.1152909 0.2280435
## 2 0.1581289 0.1845700 0.1065922 0.1783090 0.1342751 0.1182345 0.2380731
## 3 0.1486963 0.1905322 0.1083031 0.1762129 0.1325604 0.1177602 0.2448173
## 4 0.1661123 0.1853235 0.1031838 0.1638042 0.1232096 0.1174394 0.2349467
## 5 0.1606870 0.1882214 0.1047838 0.1677096 0.1368377 0.1160478 0.2555277
## 6 0.1505939 0.1838235 0.1064762 0.1748445 0.1305147 0.1152432 0.2368495
##     GluR4_N    IL1B_N   P3525_N pCASP9_N  PSD95_N    SNCA_N Ubiquitin_N
## 1 0.1427556 0.4309575 0.2475378 1.603310 2.014875 0.1082343   1.0449792
## 2 0.1420366 0.4571562 0.2576322 1.671738 2.004605 0.1097485   1.0098831
## 3 0.1424450 0.5104723 0.2553430 1.663550 2.016831 0.1081962   0.9968476
## 4 0.1450682 0.4309959 0.2511031 1.484624 1.957233 0.1198832   0.9902247
## 5 0.1408705 0.4812265 0.2517730 1.534835 2.009109 0.1195244   0.9977750
## 6 0.1364536 0.4785775 0.2444853 1.507777 2.003535 0.1206872   0.9201782
##   pGSK3B_Tyr216_N     SHH_N     BAD_N    BCL2_N   pCFOS_N     SYP_N H3AcK18_N
## 1       0.8315565 0.1888517 0.1226520 0.1325395 0.1083359 0.4270992 0.1147832
## 2       0.8492704 0.2004036 0.1166822 0.1325395 0.1043154 0.4415813 0.1119735
## 3       0.8467087 0.1936845 0.1185082 0.1325395 0.1062193 0.4357769 0.1118829
## 4       0.8332768 0.1921119 0.1327812 0.1325395 0.1112620 0.3916910 0.1304053
## 5       0.8786678 0.2056042 0.1299541 0.1325395 0.1106939 0.4341538 0.1184814
## 6       0.8436793 0.1904695 0.1315752 0.1325395 0.1094457 0.4398331 0.1166572
##      EGR1_N  H3MeK4_N   CaNA_N       .hat      .sigma      .cooksd   .fitted
## 1 0.1317900 0.1281856 1.675652 0.06057898 0.006664179 1.189185e-04 0.1474064
## 2 0.1351030 0.1311187 1.743610 0.05444041 0.006664331 7.143134e-05 0.1484818
## 3 0.1333618 0.1274311 1.926427 0.06371890 0.006663435 3.262995e-04 0.1492211
## 4 0.1474442 0.1469011 1.700563 0.04211131 0.006664598 8.129227e-06 0.1414286
## 5 0.1403143 0.1483799 1.839730 0.04532238 0.006663116 2.875387e-04 0.1463989
## 6 0.1407664 0.1421804 1.816389 0.04620239 0.006661507 6.023966e-04 0.1458891
##          .resid  .stdresid
## 1 -0.0024170821 -0.3743697
## 2  0.0019890237  0.3070683
## 3 -0.0038908789 -0.6036483
## 4 -0.0007728722 -0.1185467
## 5 -0.0044158661 -0.6784642
## 6 -0.0063245944 -0.9721740
ggplot(fit2_diag, aes(x = 1:nrow(fit2_diag), y = .cooksd)) + 
  geom_bar(stat = "identity") + 
  geom_hline(yintercept = 2, color = "red") + theme_bw() + ggtitle('Cooks distance plot')

No one is ever close to Cooks distance borderline == 2. It means we don’t have any influential observation.

gg_resid2 <-ggplot(data = fit2_diag, aes(x = .fitted, y = .stdresid)) + 
  geom_point() + 
  geom_smooth(method = 'lm') +
  geom_hline(yintercept = 0)+
  geom_hline(yintercept = 2, color = "red") +
  geom_hline(yintercept = -2, color = "red")+
  xlab('Predicted') + ylab ('Standardized residual') + theme_bw() + ggtitle('Standardized residuals distribution plot')
gg_resid2 
## `geom_smooth()` using formula 'y ~ x'

No patterns found (but we have some outliers).

ggplot(data_mice, aes(x = ERBB4_N, y = 1:nrow(data_mice))) + 
  geom_point() + theme_bw() +
  labs(y = 'Observation number', x = 'Predicted variable values') + ggtitle('Autocorrelations plot')

There are autocorrelations in our data since there were multiple measurements for each mouse.

Normality check:

ggplot(data = fit2_diag, aes(x = 1:nrow(fit2_diag), y = .stdresid)) +
  geom_point() + theme_bw() +
  labs(y = 'Standardized Residual', x = 'Observation number') + ggtitle('Normality check plot')

And QQ plots:

qqPlot(fit2_diag$.stdresid)

## [1] 142 794
qqPlot(fit2_diag$.fitted)

## [1] 89 88

Finally,multicollinearity checking:

vif(fit2)
##        DYRK1A_N         ITSN1_N          BDNF_N           NR1_N          NR2A_N 
##       44.807821       54.073873       12.007796       42.532262       20.517338 
##          pAKT_N         pBRAF_N       pCAMKII_N         pCREB_N          pELK_N 
##        8.835000        8.746842        8.538556       10.495956        4.972901 
##          pERK_N          pJNK_N          PKCA_N          pMEK_N          pNR1_N 
##       20.202296       18.351865       16.156089       13.694438       26.973124 
##         pNR2A_N         pNR2B_N        pPKCAB_N          pRSK_N           AKT_N 
##       16.840731       25.987996       20.682571        8.513558       14.409174 
##          BRAF_N        CAMKII_N          CREB_N           ELK_N           ERK_N 
##       52.275059       11.803356        6.829365       18.713024       27.485327 
##         GSK3B_N           JNK_N           MEK_N          TRKA_N           RSK_N 
##       23.692339       12.271350        8.086172       26.572626       12.090341 
##           APP_N      Bcatenin_N          SOD1_N          MTOR_N           P38_N 
##        6.344710       37.550805        8.203507       14.916410       14.678849 
##         pMTOR_N         DSCR1_N         AMPKA_N          NR2B_N         pNUMB_N 
##       11.979259        9.371754       22.415257        8.605554       16.008886 
##        RAPTOR_N         TIAM1_N        pP70S6_N          NUMB_N         P70S6_N 
##       19.631134       16.178653        8.985047       11.772174        9.204233 
##        pGSK3B_N         pPKCG_N          CDK5_N            S6_N        ADARB1_N 
##        6.083727       14.154825        3.027777        8.812942        6.074277 
##    AcetylH3K9_N          RRP1_N           BAX_N           ARC_N          nNOS_N 
##       14.703563        2.121039        4.909060        7.850880        5.394238 
##           Tau_N          GFAP_N         GluR3_N         GluR4_N          IL1B_N 
##       14.539923        4.131776        4.630194        2.202969        8.958019 
##         P3525_N        pCASP9_N         PSD95_N          SNCA_N     Ubiquitin_N 
##        5.625320        7.280353        6.942966        8.191988        8.137325 
## pGSK3B_Tyr216_N           SHH_N           BAD_N          BCL2_N         pCFOS_N 
##        9.171033        3.363377        4.430391        3.191940        3.575784 
##           SYP_N       H3AcK18_N          EGR1_N        H3MeK4_N          CaNA_N 
##        4.566674        3.857973        4.264088        5.458854       14.612223

3.3 Conclusion

Unfortunately, some of conditions of LM usage have not met. Our model suffers from autocorrelation, abnormal distribution, dependecy of observations and ton of multicollinearity. One might say ‘ok boomer our model explains ~80% of variance best model af’, but using this model would be a suicide for any serious researcher.

4. PCA

Let’s do this.

pca_res <- rda(data_mice[,3:79],scale = T)

Screeplot:

screeplot(pca_res, type = "lines", bstick = TRUE, main = 'Screeplot with Broken Stick model')

Biplot with symmetrical scaling:

biplot(pca_res)

Biplot of ordinations:

biplot(pca_res,scaling = "sites", display = "sites", main = 'Ordinations biplot')

Biplot of correlations:

biplot(pca_res,scaling = "species", display = "species", main = 'Correlations biplot')

It’s barely readable because of its clustering. We aren’t able to say where are observations of different clusters.

Let’s try to colour ordinations using data from original dataset. We need to add first 3 PCs to origianl data and using class data to colour ordination plot. Let’s build 2D plot first.

data_with_PCs <- data.frame(data_mice,
                        scores(pca_res, display = "sites", choices = c(1, 2, 3), scaling = "sites"))
ggplot(data_with_PCs, aes(x = PC1, y = PC2)) + 
  geom_point(aes(color = class), alpha = 0.5) +
  ggtitle(label = 'Ordinations biplot (coloured)') + theme_bw() + scale_fill_brewer(palette="Spectral")

And 3D plot using plotly:

fig_3D_3D <- plot_ly(data_with_PCs, x = ~PC1, y = ~PC2, z = ~PC3, marker = list(size = 2), color = ~class, colors = 'BrBG')
fig_3D <- fig_3D_3D %>% add_markers()
fig_3D <- fig_3D %>% layout(scene = list(xaxis = list(title = 'PC1'),
                                   yaxis = list(title = 'PC2'),
                                   zaxis = list(title = 'PC3')))
fig_3D

Plotly is a wondeful library that allows to plot interactive plots. Okay, plots are made, and we can clearly see overlapping of classes. But how much variance can be explained by each PC? Especially with first 3 PCs we used to plot. We already plotted screeplot, but knowing the cumulative proportion would be handy.

pca_res_PC_only <- as.data.frame(summary(pca_res)$cont)
pca_res_PC_only[3, 3]
## [1] 0.5303548

So,first 3 PCs explain ~53% of variance. Let’s build more sophisticated plot dispalying impact of each PC.

pca_summary <- summary(pca_res)
pca_result <- as.data.frame(pca_summary$cont)
plot_data <- as.data.frame(t(pca_result[c("Proportion Explained"),]))
plot_data$component <- seq(1:nrow(plot_data))
ggplot(plot_data, aes( component, `Proportion Explained`)) + geom_bar(stat = "identity") + theme_bw() + xlab('PC') + ggtitle('Screeplot without Broken Stick')

And that’s all! Thank you for your attention!